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ABSTRACT 

The number of sequenced plant genomes and 
associated genomic resources is growing rapidly 
with the advent of both an increased focus on 
plant genomics from funding agencies, and the 
application of inexpensive next generation 
sequencing. To interact with this increasing body 
of data, we have developed Phytozome (http:// 
www.phytozome.net), a comparative hub for plant 
genome and gene family data and analysis. 
Phytozome provides a view of the evolutionary 
history of every plant gene at the level of 
sequence, gene structure, gene family and genome 
organization, while at the same time providing 
access to the sequences and functional annota- 
tions of a growing number (currently 25) of 
complete plant genomes, including all the land 
plants and selected algae sequenced at the Joint 
Genome Institute, as well as selected species 
sequenced elsewhere. Through a comprehensive 
plant genome database and web portal, these data 
and analyses are available to the broader plant 
science research community, providing powerful 
comparative genomics tools that help to link 
model systems with other plants of economic and 
ecological importance. 

INTRODUCTION 

Plant genome databases have grown up around different 
plant clades [e.g. TAIR for Arabidopsis (1), Gramene for 
grasses (2), SGN for Solanaceae (3), GDR for Rosaceae 
(4), LIS for legumes (5)]. This is in part due to the 
longstanding investment in plant genetic and physical 



mapping resources and the focus of breeding programs 
in different agricultural crops. Comparative genomic data- 
bases that sample widely across the Viridiplantae 
[Phytozome, GreenPhylDB (6), Plaza (7), PlantGDB (8)] 
are a more recent development. These databases and 
associated web portals provide, at a minimum, a 
uniform set of tools and automated analyses across a 
wider range of plant genomes. In addition, those focused 
on green plant comparative genomics (GreenPhylDB, 
Plaza and Phytozome) provide putative gene families 
(groups of extant genes descended from a common ances- 
tral gene) calculated at one or more speciation nodes in 
the plant tree of life, spanning most if not all hosted 
species, as well as additional gene-centric and 
genome-centric comparative tools. Their goal is to 
provide a platform for both genome-informed investiga- 
tions of plant evolution, as well as a framework for 
transferring functional information from model plants to 
plants of agricultural, industrial and environmental 
importance. 

Phytozome ((http://www.phytozome.net), first released 
in 2008, provides a centralized hub that enables users with 
varying degrees of computational sophistication to access 
annotated plant gene families, to navigate the evolution- 
ary history of gene families and individual genes, to 
examine plant genes in their genomic context, to assign 
putative function to uncharacterized user sequences and 
provides uniform access to plant genomics data sets con- 
sisting of complete genomes, gene and related (e.g. hom- 
ologous) sequences and alignments, gene functional 
information and gene families, either in bulk or as the 
result of on-the-fly complex queries. The Phytozome web 
portal integrates a number of widely-used open source 
components [Lucene, GBrowse (9), Jalview (10), 
BioMart (11), mView (12) and pygr] with custom visual- 
ization code for gene family search, inspection and 
evaluation. 
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DATA SOURCES AND STANDARD ANALYSES 



GENE FAMILY CONSTRUCTION 



The v7.0 release of Phytozome contains data and analyses 
for 25 plant genomes, 18 of which were sequenced, 
assembled and partially or completely annotated at the 
JGI (Table 1). The gene-calling procedure for each JGI 
genome is described in detail in the associated genome 
publication, but a general overview of the JGI Plant 
Genome Annotation workflow is provided in 
Supplementary Methods SI. For non-JGI genomes and 
annotations, assembled genome sequences and gene, tran- 
script and peptide information is obtained in GFF or 
FASTA format, and subjected to consistency checking. 

For non-JGI genomes, any gene symbols, database 
cross references, deflines and experimentally-supported 
functional annotations (e.g. GO, EC) are also obtained. 
In the interests of uniformity of functional annotation, 
automatically generated functional annotations of 
non-JGI genomes are not retained. Protein-coding genes 
from both JGI and non-JGI genomes are then assigned 
PFAM domains (26), KEGG enzyme classification and 
KEGG Orthology assignment (27), KOG assignment 
(28) and Panther classification (29). Gene Ontology 
(GO) (30) assignments are made via pfam2GO mapping 
(31). All gene models and associated annotations are then 
loaded into Phytozome's mySQL database. 

Same-species and near-species EST assemblies and 
Phytozome plant peptides are aligned against each 
genome. Each genome also undergoes whole genome 
alignment against a clade-informative subset of the other 
Phytozome genomes using the VISTA pipeline (32). Gene 
and alignment tracks, as well as VISTA-derived 
genome-wide pairwise DNA alignments are all accessible 
from Phytozome's GBrowse genome browser. 



Large scale, automated gene family construction is typic- 
ally based on distance methods [Phytome (33), PlantTribes 
(34), InParanoid (35), OrthoMCL (36)] or, less frequently, 
distance-plus-character methods [OrthologID (37), 
TreeFam (38)], using a single peptide per locus in each 
genome under consideration. These distance-based 
methods can be broadly separated into two categories: 
those that implicitly (OrthoMCL) or explicitly 
(InParanoid) take into account the Mutual Best Hit 
(MBH) (39) relationship between putatively orthologous 
sequences and its role in setting a threshold for paralog 
accumulation (Supplementary Methods S2), and those 
that do not (Phytome, PlantTribes). 

Distance-based methods have the advantage of being 
generally fast and scalable. Their main disadvantage lies 
in their reliance on a single score to characterize the evo- 
lutionary divergence of sequences, which becomes more 
problematic when considering species with an ancient di- 
vergence (in which case BLASTP scores tend to lose their 
resolving power, leading to the either the accumulation of 
unrelated, weakly aligning sequences into families at low 
significance thresholds, or the exclusion of distant but true 
homologs at higher significance thresholds). 

Distance-plus-character-based methods use distance 
scores and a simple threshold to build an initial set of 
gene proto-families, all of whose members are more 
similar to each other than the threshold (e.g. 
OrthologID currently employs an ii-value threshold of 
le-20). The members of each family are then included 
in a multiple sequence alignment (MSA), and phylogenetic 
trees are constructed based on discriminating residues 
(characters) in the MSA. The actual gene families 



Table 1. The 25 completed plant genomes in version 7 of Phytozome 



Organism 


Common name 


Version 


Aquilegia coerulea 


Colorado blue columbine 


JGI vl.O 


Arabidopsis lyrata 


Lyre-leaved rock cress 


JGI vl.O (13) 


Arabidopsis thaliana 


Thale cress 


TAIR vlO (1) 


Brachypodium distachyon 


Purple false brome 


JGI /MIPS vl.O (14) 


Carica papaya 


Papaya 


ASGPB release of 2007 (15) 


Chlamydomonas reinhardtii 


Green alga 


JGI assembly v4 with Augustus update 10.2 annotation (16) 


Citrus Clementina 


Clementine 


JGI v0.9 


Citrus sinensis 


Sweet orange 


JGI/U Florida vl assembly and vl.l annotation 


Cucumis sativus 


Cucumber 


Roche 454-XLR assembly and JGI vl.O annotation 


Eucalyptus grandis 


Eucalyptus 


JGI vl.O 


Glycine max 


Soybean 


JGI Glymal assembly and Glyma 1.0 annotation (17) 


Manihot esculenta 


Cassava 


JGI/Roche/U. Arizona v4 assembly and v4.1 annotation 


Medicago truncatula 


Barrel medic 


Medicago Genome Sequence Consortium version Mt3.0 


Mimulus guttatus 


Monkey flower 


JGI vl.O release of strain IM62 


Oryza saliva 


Rice 


MSU Release 6.0 (18) 


Physcomitrella patens 


Moss 


JGI assembly vl.l and COSMOSS annotation vl.6 (19) 


Populus trichocarpa 


Poplar 


JGI assembly v2.0, annotation v2.2 (20) 


Primus persica 


Peach 


JGI vl.O 


Ricinus communis 


Castor bean 


TIGR Release 0.1 


Selaginella moeUendorffii 


Spikemoss 


JGI vl.O (21) 


Set aria italica 


Foxtail millet 


JGI assembly v2.0, annotation version 2.1 


Sorghum bicolor 


Sweet sorghum 


JGI vl.O assembly, MIPS/PASA Sbil.4 models (22) 


Vitis vinifera 


Grapevine 


Genoscope March 2010 annotation on 12X assembly (23) 


Volvox carteri 


Volvox 


JGI vl.O (24) 


Zea mays 


Maize 


Unfiltered protein coding models from Maizesequence.org release 5a.59 (25) 



For published genomes, references are included in the version column. 
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correspond to the various monophyletic nodes found in 
the resulting trees. Phylogenetic methods are traditionally 
thought to be more accurate, especially when looking at 
anciently diverged species. However, recent work (40) on 
Drosophila and fungal species that span evolutionary dis- 
tances comparable to the eudicots, has shown that a wide 
range of tree-building methods fail >50% of the time to 
produce the correct tree topology for even simple gene 
families, indicating the need for caution when making 
ortholog/paralog assignments based on gene and species 
tree reconciliation. 

Whatever construction method is chosen, each gene 
family and its associated phylogenetic tree represents a 
hypothesis of the evolutionary history of a set of extant 
genes, presumed to be descendants of a single, unobserv- 
able ancestral gene. Descendants arise either via speciation 
(giving rise to orthologous descendants) or local or larger 
scale duplication events (giving rise to paralogous des- 
cendants). As orthologs are assumed to more likely 
share a common biological function, while paralogs are 
subject to both neo- and subfunctionalization (41,42), 
the high confidence identification of orthologs allows for 
the transfer of functional information from well-studied, 
tractable model systems (e.g. Arabidopsis and 
Br achy podium) to other economically or otherwise 
relevant plants. 

Gene family construction in Phytozome uses a 
distance-based approach similar to the PhiGs method 
(43), the initial proto-family creation step used in 
TreeFam, with several modifications (Supplementary 
Methods S2). Family construction is restricted initially 
to a subset of core genomes, which are assumed to have 
relatively stable assemblies and complete structural anno- 
tations, though in some cases genomes with draft 
assemblies and annotations are used if the species in 
question is the sole representative of its clade (e.g. 
Selaginella, Physcomitrella, Mimulus). Using the assumed 
species tree, gene families are constructed at each evolu- 
tionary node, starting from the crown nodes [as in (44)] 
and moving backward in evolutionary time. At each 
bifurcating parent node, pairs of gene families from the 
two daughter nodes are combined into a parent family if 
they are joined by a cross-node MBH. Remaining families 
from the daughter nodes will be added to a parent family 
as paralogs if they have a hit to the parent that is stronger 
than the parent's best outgroup hit. This process is 
repeated down to the root node. MSAs from MUSCLE 

(45) and Hidden Markov Model (HMM) profiles from 
HMMER3 (46) are created for each core family. These 
profiles are used to 'pledge' peptides from non-core 
genomes into existing core families using HMMScan 

(46) ; they can also join core families if they are linked 
by a MBH. Non-core members can pledge to multiple 
families at a given node; thus the strict nesting of gene 
families is true for the core members only. 

Figure 1 shows a typical gene family view, with the basis 
for each gene's membership in the family displayed in the 
leftmost column. A view of this family's evolutionary 
history (Figure 2) shows the hierarchical nesting of the 
core families. 



The use of relatively strict significance and coverage 
thresholds, as well as an insistence on MBH relationships 
rather than simply strong similarity as the basis for 
seeding gene families, is intended to prevent merely 
similar gene families from coalescing at an inappropriate 
node in the tree. It also, however, biases Phytozome 
families towards underclustering. For this reason 
Phytozome includes a number of search and navigation 
tools, described below, to quickly bring together gene 
families that share overall sequence similarity or function- 
al annotation. 



PHYTOZOME TOOLS AND VIEWS 

Text and sequence search 

Genes and gene families can be retrieved from Phytozome 
by both keyword and sequence similarity searches. 
BLAST and BLAT searches of organism genomes, and 
BLAST searches of proteomes and gene family consensus 
sequences, can be used to find the genomic regions, gene 
transcripts, peptides and gene families most similar to a 
given query sequence. All gene and gene family attributes 
such as names, symbols, synonyms, external database 
identifiers, deflines and functional annotation ids (e.g. 
PFAM00071, E.C. 1.1.1.95) are searchable, and gene 
families automatically inherit the attributes of their 
members, making it straightforward to retrieve a family 
of related but mostly uncurated genes as long as at least 
one family member is well annotated. Search can be re- 
stricted to gene families at a particular evolutionary node, 
and to families matching particular absence/presence 
phylogenetic profiles. One can also search the database 
of functional annotations (e.g. keywords from the descrip- 
tions of PFAM, GO, KEGG, KOG, Panther), which re- 
trieves the set of all matching functional identifiers, and 
then automatically performs a second search for families 
marked as containing those functions. 

All genes and gene families found via keyword or 
sequence similarity searches can be viewed individually, 
as described below, or first combined 'on the fly' to 
produce composite families, before being viewed and 
analyzed with the same tools used for individual families. 

Gene family and gene page views 

The Gene Family view (Figure 1) provides the user with 
detailed information on each family and its constituent 
members, organized to highlight shared attributes. The 
default 'Genes in this family' tab displays individual 
family members, grouped by species and includes each 
member's source identifier (hyperlinked to the appropriate 
source database), aliases, synonyms and gene symbols, 
deflines (where available) and a graphical view of each 
member's local syntenic environment. A provisional 
family name is provided, as well as a membership 'finger- 
print' (member count for all species present at this node), 
and family-level KOG and KEGG-Orthology classifica- 
tion. The syntenic display can be replaced by a PFAM 
domain or gene structure (exon/intron) display. For each 
family member, links are provided to both a GBrowse 
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Figure 1. Default view of the Gene Family page for a 17 member core eudicot family. Members are listed according to their order in the tree on the 
Phytozome home page. The membership class of each gene is indicated in the leftmost column (Supplementary Methods S2). For each member, Gene 
Page and GBrowse links are provided, as well as links to external databases (if these exist), aliases, symbols and deflines. The synteny view in the 
right column shows the five upstream and five downstream neighbors of each family member (who are rendered as gray icons in the middle of each 
synteny row). Each syntenic segment is oriented to render family members in the same orientation (consistent with their presumed descent from a 
common ancestor). Gene icons sharing the same (non-white) color are all members of the same gene family at this node; this can provide syntenic 
support for the hypothesis of a common ancestor for family members. 



view (Figure 3) of each gene in its genomic context, and a 
'Gene Page' (Figure 4). 

The family page is divided into a set of lower and upper 
tabs, roughly corresponding to 'information' and 
'actions', respectively. The lower row helps users explore 
the consistency and evolutionary history of the family. 
The 'Functional Annotation' tab lists all the functional 
and domain annotations (e.g. PFAM, Panther, GO, 
KEGG, KEGG Orthology) assigned to family mem- 
bers, broken down by organism. Functional annotations 
present in all family members are highlighted. The 'MSA' 



tab displays a pre-computed MUSCLE peptide alignment 
of all family members, which is downloadable. The 
family's evolutionary history can be viewed in the 
'Family History' tab, where all families that are parents 
of, or derived from, the current family are listed. From the 
upper row of tabs, 'Find related families' provides a 
number of methods for identifying families similar to the 
current one: by family consensus sequence similarity, by 
shared functional annotation, or by shared gene member- 
ship. This is quite useful when looking for related 
subfamilies, or verifying that a particular combination of 
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Figure 2. Family History view of the gene family in Figure 1. All the descendants and ancestors of this core eudicot family (which is highlighted) are 
visible in the history view. The strict nesting of families is observed, though one needs to remember that one of the Eucalyptus genes in this core 
eudicot family is an incomplete pledge (Supplementary Methods S2), and is not present in the deeper Embryophyte and Viridiplantae ancestors. 



domains is unique to a given family. 'Align family members' 
forwards family member coding or peptide sequences 
directly to the Jalview tool, where MSAs can be created 
and edited, and subsequently used to construct phylogenetic 
trees. 'Get Data' provides access to the BioMart data query 
tool for this family, while the family page display can be 
customized on the 'Display options' tab. 

The Gene Page (Figure 4), in addition to showing single 
gene functional annotations and evolutionary history, 
includes links to alternatively spliced transcripts (if they 
exist), a simplified view of the gene in its genomic context 
(showing alternatively spliced transcripts and peptide 
homology tracks), direct access to genomic, transcript, 
coding and peptide sequences associated with this gene 
locus (color-coded to indicate exon/intron and UTR 
boundaries), and a graphical view of all other 
Phytozome peptides aligned [via dual affine 
Smith-Waterman (47)] against this gene's peptide. 

Genome-centric views are provided by GBrowse 
(Figure 3) for all 25 genomes currently included in 
Phytozome. The browsers can be accessed directly from 



the Phytozome home page, from individual member gene 
links on the Gene Family or Gene page, and from the 
BLAST/BLAT results page for searches performed 
against one of the genome target databases. In the latter 
two cases, a zoomed-in view of the genomic region con- 
taining the selected gene (or BLAST hit) is displayed. 
Each browser typically displays a gene prediction track 
(primary and alternatively spliced transcripts), a track of 
homologous peptides from related species aligned against 
the genome, supporting EST (or EST assemblies) and one 
or more VISTA tracks identifying regions of this genome 
that are syntenic with other plant genomes included in 
Phytozme. All gene features are hyperlinked to their re- 
spective Gene Page, while the VISTA tracks are linked to 
the corresponding genomic regions in the VISTA browser. 

DATA ACCESS 

For each genome hosted at Phytozome, bulk data files are 
available that contain genome assembly sequence, gene 
structure GFF3, transcript, coding and peptide sequence 
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Figure 3. GBrowse view of the local genomic context of the poplar gene from the family in Figure 1 . Primary and alternative transcripts (if present), 
assembled EST data and related plant peptides are shown aligned against the genome. Not shown are tracks of repetitive regions, GC content and 
the alignment of ESTs from related species. Interspecies whole genome alignments, displayed in the VISTA tracks, reveal the tendency towards 
strong genomic sequence conservation in coding regions (which are under selective pressure), which weakens as one considers more distantly related 
species (e.g. rice-poplar versus the more closely related eucalyptus-poplar VISTA alignments). Displayed gene models are hyperlinked to their 
respective gene pages. 
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Figure 4. Default view of the Gene Page for the Arabidopsis thaliana gene in the family of Figure 1, showing primary transcript info, functional 
annotations and simplified genomic context. This locus has an alternative transcript (which appears to differ primarily in its 5'-UTR). Note the 
strong splicing support provided by the BLATX aligned Arabidopsis lyrata peptide (which in actuality is also a member of this family). 



in FASTA format and general annotation information 
(PFAM, Panther, KOG, KEGG, best rice and 
Arabidopsis homologs). For JGI genomes, we also 
provide repeat-masked genome assemblies, as well as sup- 
porting annotation data (e.g. the PASA EST assemblies 
used in gene calling). 



Customized data sets consisting of gene or gene family 
sequences and annotations can be constructed using 
Phytozome's implementation of BioMart, where users 
can choose detailed data filters, attributes and output 
formats. BioMart can be accessed from the 'Get Data' 
tab on a gene family page (in which case the data is, by 
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default, initially restricted to that gene family), or directly 
from the Phytozome menu. It is also available at the 
BioMart central portal, http://www.biomart.org. 

PHYTOZOME SOFTWARE IMPLEMENTATION 

We have made extensive reuse of available databases, 
software tools and data formats in our implementation 
of Phytozome. The Phytozome website is built on a 
LAMPJ stack (Linux, Apache, mySQL, php/Perl and 
Java). Open source visualization components of 
Phytozome include: Gbrowse (9), the Generic Genome 
Browser, from the GMOD project, for the visualization 
of features in their genomic context; Jalview (10), a 
multiple alignment viewer and editor, for the creation, 
detailed inspection and modification of MSAs and phylo- 
genetic trees; BioMart (11), to enable query-based down- 
loads of bulk data on gene families and genome 
annotations; BioPerl (48), for the parsing and formatting 
of genomic data and BLAST results; and mView (12), for 
the visualization of MSAs. The search system is based on 
the Lucene search engine (http://lucene.apache.org/). 

FUTURE PLANS 

Phytozome content will continue to be updated at least 
annually, with new and updated genomes typically 
added in January and new feature sets released quarterly. 
Current plans for the January 2012 (v8) release include 
updates to poplar, soybean, brachypodium, maize and 
medicago, the first-time inclusion of the JGI genomes 
phaseolus (common bean) and Capsella rubella (an 
Arabidopsis comparator), and the externally contributed 
apple (49), strawberry (50) and potato (51) genomes. 
Version 8 is also expected to include genomic variation 
data (SNPs and structural variants) from the JGI and 
elsewhere, and expression data associated with the JGI 
Gene Atlas projects. Phytozome is also in the final 
stages of licensing for distribution to end users. We 
expect that the entire database and software infrastructure 
will be available for download by the end of 2011. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary methods S1-S2, Supplementary references 
[52-57]. 
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